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(1)  Foreword 

This  report  details  the  scientific  progress  of  PARC  team  in  the  GLAD-PC  project  for  the  period  of  08/01/2013-  07/31/2014.  This 
research  has  been  funded  by  the  DARPA/ADAMS  program  under  contract  W91 1NF-11-C-0216.  Any  opinions,  findings,  and 
conclusions  or  recommendations  in  this  report  and  associated  material  are  those  of  the  authors  and  do  not  necessarily  reflect 
the  views  of  the  governmentfunding  agency. 

(2)  Table  of  Contents  (None) 

(3)  List  of  Appendixes: 

•  Detecting  insider  threat  from  enterprise  social  and  online  activity  data 

•  Temporally  Coherent  Role-Topic  Models  (TCRTM):  deinterlacing  overlapping  activity  patterns 

•  Detecting  employee  churn  from  enterprise  social  and  online  activity  data 

•  PARC-ADAMS-PI-Meeting-201 50305_v4 

(4)  Statement  of  the  problems  studied 

The  PARC  team  investigated  three  approaches  to  detecting  aspects  of  malicious  insider  activity:  a)  psychological  profiling  from 
email;  b)  quitting  dynamicsand  quitting  prediction  from  corporate  social  media  data;  and  c)  detecting  unusual  and  anomalous 
behavior  from  on-line  activities. 

(5)  Summary  of  the  most  important  results 

With  regard  to  (a)  Psychological  profiling  from  email:  we  have  defined  a  Bayesian  model  for  the  motivations  and  psychology  of 
the  malicious  insider  and  an  associated  degree  of  interest.  We  aimed  then  to  predict  the  derived  psychological  variables 
automatically  from  text  in  emails.  Several  large  studies  have  been  conducted  involving  over  1000  subjects.  We  measured  the 
subjects'  psychology  using  surveys  and  collected  anonymized  features  from  their  email  communications.  We  were  able  to 
predict  the  subjects'  psychological  variables  with  up  to  95%  accuracy  (see  [Shenl]).  The  constructed  predictors  have  been 
applied  to  various  real-world  data  sets  including  large  corporate  email  data  sets.  The  results  have  been  made  accessible  to 
analysts  via  a  specific  personality  prediction  visualization  called  the  Interactive  Personality  Workbench  (described  in  last  years 
AUG  2013  -  JUL  2014  Interim  Report).  Initial  feedback  we  received  from  the  analysts  is  very  positive. 

With  regard  to  (b)  quitting  dynamics  and  quitting  prediction  from  corporate  social  media  data.  Last  year,  we  have  looked  into 
predicting  if  and  when  people  quit  a  corporation  using  their  activity  on  an  internal  social  media  network  called  Yammer.  We  got 
access  to  a  data  set  of  over  24,000  corporate  users  of  this  internal  social  media  network  of  a  large  corporation,  including  over 
2,000  groups  and  over  150,000  public  messages.  The  goal  was  to  predict,  at  any  given  time  instance,  if  an  employee  is  likely  to 
quit  the  company.  For  quitting  the  company,  we  have  identified  298  quitter  instances  among  7000  non-quitter  instances  (after 
cleaning  and  filtering  the  data  set  according  to  appropriate  parameters,  e.g.  number  of  messages  and  activity  scores).  Using  a 
random  forest  and  a  balanced  data  sets  (50%  baseline),  we  get  an  accuracy  of  68%,  which  means  an  improvement  of  36% 
over  the  baseline.  A  detailed  summary  of  the  results  including  figures  and  tables  can  be  in  [Gavail]. 

During  this  last  year  we  extended  this  work  quitting  dynamics  by  studying  employee  churn  behavior.  Employee  churn  is  a 
significant  concern  for  organizations,  with  downsides  including  loss  of  talent,  its  productivity,  and  also  security  risk,  given  that 
employees  are  likely  to  retain  confidential  company  data  after  they  quit.  PARC  developed  hypothesizes  that  precursors  to  an 
employee  quitting  a  company  will  manifest  in  the  enterprise  social  and  online  activity  data  of  the  employee.  To  this  end,  we 
processed  and  extracted  relevant  features  from  social  data  including  email  communication  patterns  and  content,  and  online 
activity  data  such  as  web  browsing  patterns,  email  frequency,  and  file  and  machine  access  patterns,  and  used  these  features  to 
build  a  predictive  model  for  detecting  employee  quitting  events  ahead  of  time.  We  tested  our  predictive  models  on  two  different 
real  world  data  sets,  and  our  experiments  show  that  we  are  able  to  detect  quitting  events  with  moderately  high  accuracy. 

Finally,  we  build  a  visualization  dashboard  that  enables  managers  and  HR  personnel  to  quickly  identify  employees  with  high 
quitting  scores,  which  will  enable  them  to  take  suitable  preventive  measures  to  reduce,  churn  [Sricharan2,  attached]. 

Regarding  (c)  detecting  unusual  and  anomalous  behavior  from  on-line  activities,  PARC  investigated  techniques  to  discover 
insider  threat  in  organizations  by  identifying  abnormal  behavior  in  enterprise  social  and  online  activity  data  of  employees.  To 
this  end,  we  processed  and  extracted  relevant  features  that  were  possibly  indicative  of  insider  threat  behavior.  This  includes 
features  extracted  from  social  data  including  email  communication  patterns  and  content,  and  online  activity  data  such  as  web 
browsing  patterns,  email  frequency,  and  file  and  machine  access  patterns.  Subsequently,  we  detect  statistically  abnormal 
behavior  with  respect  to  these  features  using  state-of-the-art  anomaly  detection  methods,  and  declare  this  abnormal  behavior 
as  a  proxy  for  insider  threat  activity.  We  tested  our  approach  on  a  real  world  data  set  (the  Vegas  data  set  from  ADAMS)  with 
artificially  injected  insider  threat  events.  Our  experiments  show  that  our  proposed  approach  is  fairly  successful  in  identifying 
insider  threat  events.  Finally,  we  build  a  visualization  dashboard  that  enables  managers  and  HR  personnel  to  quickly  identify 
employees  with  high  threat  risk  scores,  which  will  enable  them  to  take  suitable  preventive  measures  and  limit  security  risk 
[Sricharanl,  attached]. 


PARC  also  investigated  the  specific  problem  of  identifying  overlapping  activity  patterns  in  the  VEGAS  data  set.  The  Temporally 
Coherent  Role-Topic  Model  (TCRTM)  is  a  probabilistic  graphical  model  for  analyzing  overlapping,  loosely  temporally  structured 
activities  in  heterogeneous  populations.  Such  loose  temporal  structure  appears  in  many  domains,  but  especially  in  the  ADAMS 
data,  where  individual  events  that  make  up  an  activity  have  coherence,  but  no  strong  temporal  ordering.  For  instance,  preparing 
a  PowerPoint  presentation  may  involve  opening  files,  typing  text,  downloading  images,  and  saving  files.  These  activities  occur 
together  in  time,  but  without  a  strong  ordering  or  fixed  duration.  These  temporally  coherent  activities  may  also  overlap  -  the 
user  might  also  be  responding  to  email  while  working  on  the  presentation.  Finally,  the  population  of  users  has  subgroups  -  in 
the  office,  administrators,  salespeople  and  engineers  will  have  different  activity  distributions.  The  unique  architecture  of  the 
TCRTM  model  allows  it  to  automatically  infer  an  appropriate  set  of  roles  and  activity  types  while  simultaneously  assigning  users 
to  these  roles  and  segmenting  their  event  streams  into  high-level  activity  instance  descriptions.  On  two  real-world  datasets 
taken  from  computer  user  monitoring  and  social  services  debit  card  transactions  we  show  that  TCRTM  extracts  semantically 
meaningful  structure  and  improves  perplexity  score  on  hold-out  data  by  a  factor  of  five  compared  to  standard  models  such  as 
LDA  [Bartl ,  attached]. 

All  of  these  results  and  summary  of  PARCs  work  on  ADAMS  was  presented  at  the  final  ADAMS  PI  Meeting,  held  at  DARPA,  in 
March  2015  (the  briefing  slides  are  attached). 
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Technology  Transfer 


See  attachment:  Technology  transition  from  PARC 
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Personality  and  Anomalous  Behavior 


•  Organizations  (and  society)  face  increasing 
amount  of  threats  from  “inside”  and 
“outside” 

•  Challenge:  Uncover  malicious  behavior  in  a 
timely  way  through  automatic  analysis 

•  Anomalous  behavior  trace  often  precedes 
the  actual  “incident” 

•  Personality  has  been  shown  to  be  a  reliable 
indicator  for  future  (malicious)  behavior 
(Jaclyn  et  al.,  2011) 

•  50%  of  job  quitters  steal  confidential 
company  data 
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ADAMS  GLAD-PC 


•  PARC  project: 

-  Graph  Learning  and 
Anomaly  Detection 
using  Psychological 
Context  (GLAD-PC) 

-  Idea:  combine  graph 
learning  /  structural 
anomaly  detection  and 
psychological  modeling 


Previous  Research:  Personality  Profiling 
for  Malicious  Insider  Detection 


•  We  are  interested  in  psychological  profiles  as  a 
indicators  for  future  malicious  behavior 

•Why? 

•  Counterproductive  (cyber-)behaviors  have  been  shown  to  be 
highly  correlated  with  Big-5  personality  variables  [1] 

•  Actual  insider  threats  have  a  low  base  rate  ->  psychological 
profiles  are  a  powerful  filter  to  reduce  false  positives 


[1]  Jaclyn  M.  Jensen,  Pankaj  C.  Patel,  Predicting  counterproductive  work  behavior  from  the  interaction  of 
personality  traits,  Personality  and  Individual  Differences  51(4):466-471 ,  Sept.  2011. 
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What  is  a  Personality  Profile? 


Personality  Variables 

Neuroticism 

Agreeableness 

Conscientiousness 

Excitement  Seeking 

Hostility 

Extraversion 

Self-Assurance 

Overall  Mood/Emotion 

Organizational  Deviance 

Personal  Deviance 

Perceived  Stress 

Degree  of 
Interest 


Approach 


•  Idea:  automatically  estimate  personality  from  emails 


organizational 

email 


“feature” 

anonymization 


Largelraining  data 
corpus  (MechTurk, 
company-internal ) 


Machine 
Learning 
(survival  model, 
SVM  classifier) 


Personality 

Estimators 


Viz:  Interactive 

Personality 

Workbench 


Results 


•  Data  Collection  &  Evaluation  Results  (of  Estimators): 


-  Over  1000  personality  profiles  +  emails  collected  from  MechTurk  and 
company-internal  (for  training  the  estimations) 
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Interactive  Personality  Visualization 


User:  userOOlO  * 
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Previous  Research:  Quitting  and 
Destructive  Group  Dynamics 


•  We  are  interested  in  quitting  behavior  & 
destructive  group  dynamics 

•  Proxy  of  malicious  behavior:  “50%  job  leavers  steal 
confidential  company  data” 

•  Questions: 

-  Can  we  observe  quitting  behavior  and  destructive 
group  dynamics  in  real-world  and  social  space 

-  How  is  real-world  behavior  related  to  social  space  data 

-  Can  we  predict  real-world  behavior 
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Previous  Research  on  Quitting  Behavior 


Online  Games 


Startup  Venture 


Yammer 


Destructive  group 

dynamics 

•  if/when  a  player  will 
quit  a  guild 

•  damage  associated 
with  a  quit  event 

•  guild  stability  against 
member  loss 


Churn 

prediction  in 
a  real-world 
corporation 


Predict  quitting  based  on 
work  practice,  email,  and 


content. 
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2014  Summer  Intern 


Literature  Search 


Voluntary 

Turnover 

(Quitting) 

Computer-  \ 
Mediated 

Affect  and 

Communication 

Emotional 
Contagion  in 
Workplace 

in  Workplace 

-  Yiran  Wang 


Structured  Interviews 


•  Recent  quitters 

•  N=12  (Male  =  9,  Female  =  3) 

•  Job  titles  include: 

•  research  scientist  (2) 

•  software  engineer  (3) 

•  research  engineer  (2) 

•  director/manager  (1) 

•  senior  associate  in  a  bank  (1 ) 

•  system  engineer  (1) 

•  manufactory  engineer  (1 ) 

•  office  manager  (1) 
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Interview  Results 


•  Web  Browsing 

-  Increased  use  of  career 
sties  (e.g.,  Linkedln) 

—  Increased  browsing  of 
company  profiles 

•  Personal  Email 

-  Increased  use  of  personal 
email  for  job  applications 

•  Work  Email 

-  No  conscious  change 

-  Some  made  an  effort  to 
maintain  normal  email 
behavior 


Work  Routine 

-  Shortened  work  hours  and 
more  time  off  to 
accommodate  interviews 

Multitasking 

-  Shortened  attention  spans 
at  work 

-  more  task  switching 

Engagement 

-  Decreasing  engagement  in 
general 

-  More  neutral  sentiment  in 
emails 
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Features  Selection/Engineering 


•  Extract  a  rich  set  of  features: 

•  Email  Usage  (-sent  count) 

•  Email  Content  (-subject  char  length) 

•  Log  On  /  Log  Off  Statistics 

•  Application  Activity  (+max  time  spent  on 
activity,  +  #  of  activity  types  per  day) 

•  Web  Usage  (time  on  -internal/+job  sites) 

•  Feature  matrix  F:  U  x  T  x  D 
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This  Year’s  Problem  Set-up 


Quitting 

Examples 


Vegas 

Database 


Features 


Email  Usage 


Email  Content 


Logon/Logoff 


App.  Activity 


Web  Usage 


Quitting 

Classifier 


Anomaly 

Detector 


Quitting 

Prediction 


Anomaly 

Prediction 
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Problem  set-up 


•  Twin  approaches: 

•  Supervised  -  Use  quitting  labels  as  proxy 

•  Build  classifier  to  predict  quitters  and 
corresponding  time  instances 

•  Unsupervised  -  Use  anomaly  detection 
methods  to  detect  abnormal  behavior 
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Vegas  Dataset 


•  Multi-Domain  Employee  Data 

•  Anonymized  application-wise  log  of  User 
activity 

•  Anonymized  activity  log  of  user  interactions  with 
different  agents 

•  Email  interaction  data  between  business  unit 
users 

•  Aggregated  statistics  on  Email  content  data 

•  Snapshots  of  LDAP  hierarchy 

• — Day-to-day  LDAP  diffs - 
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Vegas  Dataset 


Dataset  Statistics 

Date  range 

2013-10-01  to  2014-07-01 
(8  months) 

Users 

6805  users 

Dataset  Size 

~  1  billion  User  Activity  Records 

Domains 

Email  Usage,  Email  Content,  Logon  Logoff, 
Application  Usage,  Web  Usage 

Target  Users 

-  555  Quitters  (1270  Pseudo) 

-  1 04  Red  Team  Users 
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Feature  Extraction 


•  Calculated  aggregate 
features  from  raw  data 

•  Constructed  features  in  5 
different  domains 

•  Features  developed  from 
earlier  Yammer  work  were 
supplemented  with  newer 
ones  derived  from  insights 
gained  by  conducting 
interviews  with  employees 
that  quit  their  jobs 


Email  Usage  Features 

Weekly  sent  count 

Weekly  read  count 

Number  of  messages  sent  in  the  day 

Number  of  messages  sent  at  night 

Number  of  messages  read  in  the  day 

Number  of  messages  read  at  night 

Email  Content  Features 

Average  subject  word  length 

Average  subject  character  length 

Average  content  character  length 

Average  content  word  length 

Average  content  sent  length 

Number  of  exclamation  points 

Number  of  multiple  exclamation  points 

Number  of  question  marks 

Number  of  multiple  question  marks 

Number  of  brackets 

Number  of  dashes 

Number  of  double  dashes 

Number  of  ellipses 

Number  of  commas 

Number  of  semicolons 

Number  of  colons 


Logon  Logoff  Features 


Number  of  logons 

Number  of  logoffs 

Number  of  hours  with  logon  activity 

Number  of  hours  with  logoff  activity 


Activity  Features 


Number  of  activity  types 

Max  contiguous  time  spent  on  activity 

Number  of  activities 

Time  spent  on  on  email  applications 

Time  spent  on  on  productivity  applications 

Time  spent  on  on  web  applications 

Time  spent  on  on  engineering  applications 


Web  Usage  Features 


Time  spent  on  on  websites 

Time  spent  on  on  career  sites 

Time  spent  on  on  web  mail  sites 

Time  spent  on  on  entertainment  sites 

Time  spent  on  on  internal  SM  sites 

Time  spent  on  on  internal  sites 

Time  spent  on  on  news  sites 

Time  spent  on  on  private  social  media  sites 

Time  spent  on  on  search  sites 

Time  spent  on  on  tech  sites 
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Hierarchy  Creation 


•  Needed  a  hierarchy  of  the  organization  to  be  able 
to  compare  the  behavior  of  a  user  with  their  peers 

•  Data  available:  daily  snapshots  of  LDAP  hierarchy 

•  We  created  a  normalized  hierarchy  by  finding  the 
most  persistent  relationships  between  supervisors 
and  employees  over  the  time  period  in 
consideration 

•  Resulted  in  -200  sub-trees  due  to  the  business 
unit  not  containing  the  higher  levels  of  the 
hierarchy 
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Hierarchies 


•  Examples  of  sub-trees 

•  ext —  nodes  are  external  to  the  business  unit 


688B6B 


94ECE2 


864145 


886400 


B08F2A 


exl758 
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Supervised  approach 

Quitting  Detection 


Problem  statement:  At  any  given  time,  predict  if  an 
employee  is  likely  to  quit  the  company: 

•  Restrict  attention  to  (User  U,  Time  T)  tuples  such  that 
user  U  has  data  for  at  least  1  month  leading  up  to  time  T 

•  0.6M  such  total  instances;  2K  /  0.6M  (~  0.5%)  instances 
are  when  user  U  has  quit  in  time  T,  T-1  or  T-2 

•  Subsample  to  deal  with  class-imbalance  problem 
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Supervised  approach 

Quitting  Detection 


•  Accuracy  =  73%  using  Random  forests 

(46%  improvement  compared  to  random  baseline) 

•  Content  features  are  most  predictive  for  quitters 
and  pseudo-quitters 


Class 

+ 

- 

•  Confusion  Matrix: 

0.746 

0.254 

+ 

- 

0.310 

0.690 
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Supervised  approach 

Quitting  Detection 
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Quitting  Visualization  Dashboard 


ADAMS  Dashboard 


Quitting  Risk  and  Anomaly  Detection  Features  using  4,524  out  of  4,524  records  i  Reset  All 
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Quitting  Visualization  Dashboard 


Anomaly  Measure  Quitting  Risk 
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Supervised  approach 

Quitting  Detection  -  Insight 


•  Quitting  scores  tend  to  peak  ~2  weeks  before 
quitting 


PARC  |  27 


Supervised  approach 

Quitting  Detection  -  Insight 


•  Quitting  scores  tend  to  peak  ~2  weeks  before 
quitting 


Quitting  Risk 


PARC  |  28 


Unsupervised  approach 

Quitting  Detection 


Detect  anomalies  with  respect  to  two  aspects: 

•  Detect  if  user  is  anomalous  with  respect  to 
rest  of  the  employees  at  each  time  instance 

•  Detect  if  user’s  behavior  has  changed 
drastically  over  time 

•  Idea:  In  addition  to  features  F,  also 
construct  differences 

dF  =  F[:,T+1,:]  -  F[:,T, :] 

•  Run  Forest  on  joint  matrix  [F;dF] 
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Unsupervised  approach 

Quitting  Detection 


Ranks  of  Red  Team  Users 


Ranks  of  Non  Red  Team  Users 
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Unsupervised  approach 

Quitting  Detection 


Ranks  of  Red  Team  Users 


0  1000  2000  3000  4000  5000 


Rank 


Can  identify  46%  of 
red -team  events  by 
tracking  top  15%  of 
users  every  week 

85%  by  tracking  top 
35% 
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Anomaly  Visualization  Dashboard 


ADAMS  Dashboard 
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Conclusion  and  Future  Work 


•  End-user  activity  can  be  used  to  determine 
suspect  insider  threat  behavior 

•  False-alarms  fairly  significant,  due  to 

•  Rarity  of  abnormal  events 

•  Statistical  anomalies  that  do  not  translate  to 
real  world 
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Conclusion  and  Future  Work 


•  Further  research  needed  to  bring  down  false 
alarm  rate 

•  Integration  of  external  data  sources 

•  Integration  of  psychological  modeling 

•  Incorporating  analyst  feedback  to  select 
features 
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Thank  you! 
Questions? 


